{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "## 样例数据生成笔记\n",
        "\nFaker 是一个强大的伪数据生成库"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "import random\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "\n",
        "from faker import Faker # 注意大写\n",
        "from random import choice"
      ],
      "outputs": [],
      "execution_count": 1,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 本地化\n",
        "\n根据需求可以在实例化时指定一个地区，从而改变随机生成的数据内容（比如人名、地理位置等）"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "fake = Faker('zh-CN')"
      ],
      "outputs": [],
      "execution_count": 2,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### UA\n",
        "\n值得单独一提的是随机生成伪造 User-Agent，放入循环中不断生成即可用于爬虫伪装，甚至可以指定各类型浏览器，"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# 直接随机\n",
        "fake.user_agent()"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 3,
          "data": {
            "text/plain": [
              "'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/5352 (KHTML, like Gecko) Chrome/44.0.859.0 Safari/5352'"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 3,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 指定浏览器，可进一步指定参数范围\n",
        "print('# chrome')\n",
        "print(fake.chrome(version_from=13, version_to=63, build_from=800, build_to=899))\n",
        "print('# firefox')\n",
        "print(fake.firefox())\n",
        "print('# safari')\n",
        "print(fake.safari())\n",
        "print('# ie')\n",
        "print(fake.internet_explorer())\n",
        "print('# opera')\n",
        "print(fake.opera())"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "# chrome\n",
            "Mozilla/5.0 (X11; Linux i686) AppleWebKit/5361 (KHTML, like Gecko) Chrome/16.0.829.0 Safari/5361\n",
            "# firefox\n",
            "Mozilla/5.0 (Windows NT 5.0; fi-FI; rv:1.9.1.20) Gecko/2014-04-23 03:27:37 Firefox/3.8\n",
            "# safari\n",
            "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_8_5 rv:3.0; csb-PL) AppleWebKit/531.28.6 (KHTML, like Gecko) Version/5.1 Safari/531.28.6\n",
            "# ie\n",
            "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.2; Trident/4.1)\n",
            "# opera\n",
            "Opera/8.19.(Windows CE; hak-TW) Presto/2.9.179 Version/12.00\n"
          ]
        }
      ],
      "execution_count": 4,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 结合 Pandas\n",
        "\n",
        "根据各个字段的不同需求，生成一张完全的伪造数据表\n",
        "\n参考了[Stack Overflow上的问题](https://stackoverflow.com/questions/45574191/using-python-faker-generate-different-data-for-5000-rows?r=SearchResults)"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "def create_rows(num=1):\n",
        "    output = [{\"name\":fake.name(),\n",
        "               \"email\":fake.email(),\n",
        "               \"bs\":fake.bs(),\n",
        "               \"boolean\":fake.boolean(),\n",
        "               \"address\":fake.address(),\n",
        "               \"province\":fake.province(),\n",
        "               \"city\":fake.city(),\n",
        "               \"date_time\":fake.date_time_between(start_date='-2y',end_date='now'), # 参数不支持一般字符类型日期格式，要指定具体日期时需要用 start_date=datetime.date(year=2019,month=1,day=1) 这样\n",
        "               \"system\":random_system(), # 跨平台需要自己写一个，暂时还没有手机系统\n",
        "               \"random_int\":fake.random_int(10,20)} for x in range(num)]\n",
        "    return output\n",
        "\n",
        "def random_system():\n",
        "    sys_list = [fake.linux_platform_token(), fake.mac_platform_token(), fake.windows_platform_token()]\n",
        "    sys = random.choice(sys_list)\n",
        "    return sys"
      ],
      "outputs": [],
      "execution_count": 5,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "df = pd.DataFrame(create_rows(500))\n",
        "df.head()"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 6,
          "data": {
            "text/plain": [
              "                     address  boolean                                   bs  \\\n",
              "0         河南省飞县山亭年路p座 194876    False      deliver value-added experiences   \n",
              "1       青海省阜新市沈北新邱路B座 506004     True        deliver efficient experiences   \n",
              "2  广西壮族自治区六盘水市孝南佛山街K座 512375     True     visualize plug-and-play channels   \n",
              "3       黑龍江省拉萨市永川贾路L座 781193     True  streamline visionary info-mediaries   \n",
              "4        河南省兰州县房山蒋路W座 553582    False            brand distributed markets   \n",
              "\n",
              "   city           date_time                email name province  random_int  \\\n",
              "0    萍市 2018-05-27 12:17:17     shenxiulan@vs.cn  司秀荣      浙江省          10   \n",
              "1  六盘水县 2018-12-24 11:09:32     mayang@gmail.com   汲磊      福建省          16   \n",
              "2    坤县 2017-02-07 09:20:34  ming69@juanwang.org  蔺秀梅      山西省          11   \n",
              "3   银川县 2018-05-29 18:27:00    maxia@hotmail.com   经龙      湖南省          16   \n",
              "4   香港县 2017-12-26 07:44:11     vqian@minzeng.cn  暴桂香      江西省          19   \n",
              "\n",
              "                                 system  \n",
              "0     Macintosh; Intel Mac OS X 10_11_9  \n",
              "1      Macintosh; Intel Mac OS X 10_7_8  \n",
              "2  Macintosh; U; Intel Mac OS X 10_12_7  \n",
              "3                       X11; Linux i686  \n",
              "4                     X11; Linux x86_64  "
            ],
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>address</th>\n",
              "      <th>boolean</th>\n",
              "      <th>bs</th>\n",
              "      <th>city</th>\n",
              "      <th>date_time</th>\n",
              "      <th>email</th>\n",
              "      <th>name</th>\n",
              "      <th>province</th>\n",
              "      <th>random_int</th>\n",
              "      <th>system</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>河南省飞县山亭年路p座 194876</td>\n",
              "      <td>False</td>\n",
              "      <td>deliver value-added experiences</td>\n",
              "      <td>萍市</td>\n",
              "      <td>2018-05-27 12:17:17</td>\n",
              "      <td>shenxiulan@vs.cn</td>\n",
              "      <td>司秀荣</td>\n",
              "      <td>浙江省</td>\n",
              "      <td>10</td>\n",
              "      <td>Macintosh; Intel Mac OS X 10_11_9</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>青海省阜新市沈北新邱路B座 506004</td>\n",
              "      <td>True</td>\n",
              "      <td>deliver efficient experiences</td>\n",
              "      <td>六盘水县</td>\n",
              "      <td>2018-12-24 11:09:32</td>\n",
              "      <td>mayang@gmail.com</td>\n",
              "      <td>汲磊</td>\n",
              "      <td>福建省</td>\n",
              "      <td>16</td>\n",
              "      <td>Macintosh; Intel Mac OS X 10_7_8</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>广西壮族自治区六盘水市孝南佛山街K座 512375</td>\n",
              "      <td>True</td>\n",
              "      <td>visualize plug-and-play channels</td>\n",
              "      <td>坤县</td>\n",
              "      <td>2017-02-07 09:20:34</td>\n",
              "      <td>ming69@juanwang.org</td>\n",
              "      <td>蔺秀梅</td>\n",
              "      <td>山西省</td>\n",
              "      <td>11</td>\n",
              "      <td>Macintosh; U; Intel Mac OS X 10_12_7</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>黑龍江省拉萨市永川贾路L座 781193</td>\n",
              "      <td>True</td>\n",
              "      <td>streamline visionary info-mediaries</td>\n",
              "      <td>银川县</td>\n",
              "      <td>2018-05-29 18:27:00</td>\n",
              "      <td>maxia@hotmail.com</td>\n",
              "      <td>经龙</td>\n",
              "      <td>湖南省</td>\n",
              "      <td>16</td>\n",
              "      <td>X11; Linux i686</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>河南省兰州县房山蒋路W座 553582</td>\n",
              "      <td>False</td>\n",
              "      <td>brand distributed markets</td>\n",
              "      <td>香港县</td>\n",
              "      <td>2017-12-26 07:44:11</td>\n",
              "      <td>vqian@minzeng.cn</td>\n",
              "      <td>暴桂香</td>\n",
              "      <td>江西省</td>\n",
              "      <td>19</td>\n",
              "      <td>X11; Linux x86_64</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 6,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    }
  ],
  "metadata": {
    "kernel_info": {
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.6.5",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "kernelspec": {
      "name": "python3",
      "language": "python",
      "display_name": "Python 3"
    },
    "nteract": {
      "version": "0.12.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}